feat(search): add Google Gemini embedding provider#27974
Conversation
Adds a fourth embedding provider (google) alongside openai/bedrock/djl, using the Generative Language API with a single API key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7 tasks covering schema change + regen, client implementation, validation tests, error path tests, request shape tests, switch wiring, and final verification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ient The string "models/" appeared in both DEFAULT_BASE_URL and the buildRequestBody method. Extract it as a named constant per project standards. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ound Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…shape Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…t comment Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new google embedding provider (Google Gemini / Generative Language API) to OpenMetadata’s vector search embedding client framework, alongside the existing bedrock, openai, and djl providers.
Changes:
- Extended the ElasticSearch configuration schema with a new
naturalLanguageSearch.googleblock and updated provider description text. - Implemented
GoogleEmbeddingClient(HTTP call, request/response JSON handling, error extraction, endpoint override support). - Wired the new provider into
SearchRepository.createEmbeddingClientand added a dedicated unit test suite.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/configuration/elasticSearchConfiguration.json | Adds google provider config block under naturalLanguageSearch and updates provider description. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClient.java | New embedding client implementation for Gemini (Generative Language API). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchRepository.java | Adds google case to embedding client provider switch. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/vector/client/GoogleEmbeddingClientTest.java | Adds unit tests for Google embedding client behavior and request construction. |
| docs/superpowers/specs/2026-05-07-google-gemini-embedding-client-design.md | Design spec documenting the provider, config shape, and behavior. |
| docs/superpowers/plans/2026-05-07-google-gemini-embedding-client.md | Implementation plan and step-by-step checklist for the change. |
| private HttpRequest buildRequest(String body) { | ||
| String encodedKey = URLEncoder.encode(apiKey, StandardCharsets.UTF_8); | ||
| String url = endpoint + "?key=" + encodedKey; | ||
| return HttpRequest.newBuilder() | ||
| .uri(URI.create(url)) | ||
| .header("Content-Type", "application/json") | ||
| .timeout(Duration.ofSeconds(30)) | ||
| .POST(HttpRequest.BodyPublishers.ofString(body)) | ||
| .build(); |
| "default": 768 | ||
| }, | ||
| "endpoint": { | ||
| "description": "Custom endpoint URL. Leave empty for the default Generative Language API.", |
| java.util.concurrent.atomic.AtomicReference<String> captured = | ||
| new java.util.concurrent.atomic.AtomicReference<>(); | ||
| java.util.concurrent.atomic.AtomicReference<Throwable> failure = | ||
| new java.util.concurrent.atomic.AtomicReference<>(); | ||
| request | ||
| .bodyPublisher() | ||
| .ifPresent( | ||
| publisher -> { | ||
| java.util.concurrent.Flow.Subscriber<java.nio.ByteBuffer> subscriber = | ||
| new java.util.concurrent.Flow.Subscriber<>() { | ||
| private final java.io.ByteArrayOutputStream out = | ||
| new java.io.ByteArrayOutputStream(); | ||
|
|
||
| @Override | ||
| public void onSubscribe(java.util.concurrent.Flow.Subscription subscription) { | ||
| subscription.request(Long.MAX_VALUE); | ||
| } | ||
|
|
||
| @Override | ||
| public void onNext(java.nio.ByteBuffer item) { | ||
| byte[] arr = new byte[item.remaining()]; | ||
| item.get(arr); | ||
| out.write(arr, 0, arr.length); | ||
| } | ||
|
|
||
| @Override | ||
| public void onError(Throwable throwable) { | ||
| failure.set(throwable); | ||
| } | ||
|
|
||
| @Override | ||
| public void onComplete() { | ||
| captured.set(out.toString(java.nio.charset.StandardCharsets.UTF_8)); | ||
| } | ||
| }; | ||
| publisher.subscribe(subscriber); | ||
| }); | ||
| if (failure.get() != null) { | ||
| throw new RuntimeException("Body publisher failed", failure.get()); | ||
| } | ||
| String body = captured.get(); | ||
| if (body == null) { | ||
| throw new IllegalStateException("Request had no body publisher"); | ||
| } | ||
| return body; |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
🔴 Playwright Results — 1 failure(s), 17 flaky✅ 4067 passed · ❌ 1 failed · 🟡 17 flaky · ⏭️ 86 skipped
Genuine Failures (failed on all attempts)❌
|
These were workflow scaffolding (design spec + implementation plan) generated by the superpowers brainstorming/planning flow; they belong in the local development trail, not the PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- GoogleEmbeddingClient.buildRequest: handle endpoint with existing query string by switching the key separator from '?' to '&' as needed; document why the API key travels in the URL (Google Generative Language API requirement, not Bearer-header). - GoogleEmbeddingClient.extractErrorMessage: replace empty catch block with a trace-level log to comply with the 'no empty catch' standard. - elasticSearchConfiguration.json: clarify google.endpoint description so operators know it must be the full ':embedContent' URL, not a base URL. - GoogleEmbeddingClientTest.extractBody: await onComplete via CompletableFuture.get(5s) instead of relying on synchronous publisher delivery; surface onError properly. - New test: testEndpointWithExistingQueryStringUsesAmpersand verifies the '?' / '&' separator logic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
…-client # Conflicts: # conf/openmetadata.yaml # openmetadata-ui/src/main/resources/ui/src/generated/configuration/elasticSearchConfiguration.ts # openmetadata-ui/src/main/resources/ui/src/generated/settings/settings.ts
| "embeddingModelId": { | ||
| "description": "Gemini embedding model identifier (e.g., gemini-embedding-001, text-embedding-004).", | ||
| "type": "string", | ||
| "default": "gemini-embedding-001" | ||
| }, |
| google: | ||
| apiKey: ${GOOGLE_API_KEY:-""} # API key from Google AI Studio | ||
| embeddingModelId: ${GOOGLE_EMBEDDING_MODEL_ID:-"gemini-embedding-001"} | ||
| embeddingDimension: ${GOOGLE_EMBEDDING_DIMENSION:-768} # Sent as outputDimensionality. gemini-embedding-001 supports 768/1536/3072; text-embedding-004 supports 768. |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
Code Review ✅ Approved 3 resolved / 3 findingsIntegrates Google Gemini as an embedding provider with support for flexible model selection and comprehensive input validation. Issues regarding empty error handling and null pointer risks in the repository configuration have been resolved. ✅ 3 resolved✅ Quality: Empty catch block in extractErrorMessage violates guidelines
✅ Edge Case: NPE if google config is null in SystemRepository switch
✅ Security: API key in URL query string may leak via server access logs
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|
* Add design: Google Gemini embedding client Adds a fourth embedding provider (google) alongside openai/bedrock/djl, using the Generative Language API with a single API key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Add implementation plan: Google Gemini embedding client 7 tasks covering schema change + regen, client implementation, validation tests, error path tests, request shape tests, switch wiring, and final verification. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(spec): add google embedding provider config block Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * feat(search): add GoogleEmbeddingClient with happy-path test Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * refactor(search): extract MODELS_PREFIX constant in GoogleEmbeddingClient The string "models/" appeared in both DEFAULT_BASE_URL and the buildRequestBody method. Extract it as a named constant per project standards. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(search): add constructor validation tests for GoogleEmbeddingClient Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(search): add blank model id test and clarify null-modelId workaround Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(search): add HTTP error and malformed response tests for GoogleEmbeddingClient * test(search): tighten empty values array assertion to check message Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> * test(search): verify Google embedding request URL, headers, and body shape Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * test(search): extract endpoint constant and harden extractBody helper Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(search): wire google embedding provider into SearchRepository switch * test(search): cover null dimension and custom endpoint, drop redundant comment Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Update generated TypeScript types * Remove internal planning docs from PR These were workflow scaffolding (design spec + implementation plan) generated by the superpowers brainstorming/planning flow; they belong in the local development trail, not the PR. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Address PR review comments - GoogleEmbeddingClient.buildRequest: handle endpoint with existing query string by switching the key separator from '?' to '&' as needed; document why the API key travels in the URL (Google Generative Language API requirement, not Bearer-header). - GoogleEmbeddingClient.extractErrorMessage: replace empty catch block with a trace-level log to comply with the 'no empty catch' standard. - elasticSearchConfiguration.json: clarify google.endpoint description so operators know it must be the full ':embedContent' URL, not a base URL. - GoogleEmbeddingClientTest.extractBody: await onComplete via CompletableFuture.get(5s) instead of relying on synchronous publisher delivery; surface onError properly. - New test: testEndpointWithExistingQueryStringUsesAmpersand verifies the '?' / '&' separator logic. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Update generated TypeScript types * Wire google embedding provider into openmetadata.yaml defaults - Add `google:` block under naturalLanguageSearch with env-var fallbacks (GOOGLE_API_KEY, GOOGLE_EMBEDDING_MODEL_ID, GOOGLE_EMBEDDING_DIMENSION, GOOGLE_API_ENDPOINT). - Update embeddingProvider option list comment to include "google". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Use gemini-embedding-001 default and pass outputDimensionality The previous default (text-embedding-004) is rejected on some Google projects with `404: not found for API version v1beta, or is not supported for embedContent`. Switch to gemini-embedding-001 — the current GA model, available at v1beta and broadly accessible. - GoogleEmbeddingClient.buildRequestBody: include outputDimensionality from the configured embeddingDimension. Required for gemini-embedding-001 (defaults to 3072 dims otherwise) and supported as a truncation hint by text-embedding-004. - elasticSearchConfiguration.json + openmetadata.yaml: change default embeddingModelId to gemini-embedding-001 and document the outputDimensionality semantics on the embeddingDimension field. - GoogleEmbeddingClientTest.testRequestBodyShape: assert outputDimensionality=768 in the captured body and use gemini-embedding-001 as the test fixture model. - SystemRepository.getEmbeddingConfigurationMessage: add a `google` case so /api/v1/system/status surfaces the configured model/endpoint instead of "Unknown provider 'google'". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Update generated TypeScript types * Guard against missing google config in SystemRepository diagnostic If `embeddingProvider=google` but the `google` config block is absent, calling `nlpConfig.getGoogle().getEndpoint()` would NPE and produce a misleading "Unable to determine embedding configuration" message. Add an explicit null check that yields a clear diagnostic instead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Validate google.endpoint contains :embedContent at construction A custom endpoint missing the `:embedContent` action used to silently produce 404s at runtime. Fail fast at startup with a clear message showing the expected URL form, so misconfiguration surfaces in logs instead of in vector-search failures. - Update testCustomEndpointConstruction to use a valid full URL. - Add testCustomEndpointWithoutEmbedContentThrows. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * feat(spec): add modelId chat field to google block Adds a `modelId` property to the natural-language-search `google` block, parallel to how the `openai` block exposes both `modelId` (chat) and `embeddingModelId` (embedding). This enables Gemini-based NLQ filter extraction (chat completions via :generateContent) on top of the existing embedding support. Default: gemini-2.5-flash. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> * Update generated TypeScript types * Update generated TypeScript types * trigger --------- Co-authored-by: Claude Opus 4.7 <noreply@anthropic.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>



Summary
Adds a fourth embedding provider — Google Gemini via the Generative Language API — alongside the existing
openai,bedrock, anddjlproviders. Operators can now point natural-language / semantic search at Gemini models using a single API key from Google AI Studio (no GCP project, service account, or OAuth setup required).googleblock undernaturalLanguageSearchinelasticSearchConfiguration.jsonwithapiKey,embeddingModelId,embeddingDimension,endpoint, andmodelId(the latter parallel toopenai.modelIdfor future Gemini-based NLQ chat completions).GoogleEmbeddingClientmirroringOpenAIEmbeddingClient: API key in URL query string (Google API requirement),content.parts[].textrequest body,embedding.valuesresponse parsing, error-message extraction from Google's standard error envelope. The configuredembeddingDimensionis sent asoutputDimensionalityso the response vector size matches the OpenSearch index shape (required forgemini-embedding-001, which defaults to 3072 dims otherwise).SearchRepository.createEmbeddingClientswitch and agooglecase inSystemRepository.getEmbeddingConfigurationMessageso/api/v1/system/statusreports the configured Google model/endpoint.google.endpointvalues are validated at construction to contain:embedContent— misconfigurations fail fast at startup with a clear message instead of opaque 404s at first embed call.HttpClientstubs (no Mockito) — covering construction, validation, success path, HTTP errors, malformed responses, request-shape verification (URL, headers, body,outputDimensionality), URL encoding, and custom-endpoint paths.Defaults to
gemini-embedding-001/ 768 dim.text-embedding-004is also supported via configuration. The dimension setting is honored end-to-end viaoutputDimensionality.Test plan
mvn test -pl openmetadata-service -Dtest=GoogleEmbeddingClientTest— 25 tests passmvn test -pl openmetadata-service -Dtest='*EmbeddingClientTest'— siblingOpenAIEmbeddingClientTest/EmbeddingClientTestpass, no regressionsmvn spotless:apply -pl openmetadata-service— clean/api/v1/system/statusreturns Google config diagnostic instead of "Unknown provider 'google'"semantic_searchtool)Out of scope
googleVertexAiprovider)modelIdfield into an NLQ chat path — schema field is added; client implementation is a separate PR:batchEmbedContents) — uses base-class default serial loop, matching siblingsSummary by Gitar
maxConcurrentEmbeddingRequeststomaxConcurrentRequestsin both configuration schema andopenmetadata.yamlto support broader request throttling.Entity.PAGEcase todeleteOrUpdateChildreninSearchRepositoryto enable recursive hard-deletion by FQN prefix.BedrockandOpenAIproviders includingtimeoutSeconds,maxTokens, andtemperature.filterExtractorandhybridSearchconfiguration objects toelasticSearchConfiguration.jsonfor fine-tuning NLQ and search pipeline parameters.modelIdfield to thegoogleconfiguration block for flexible Gemini model selection.google.endpointcontains the required:embedContentsuffix and added diagnostic guards for system configuration.This will update automatically on new commits.